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Abstract. We study the dynamics of supervised on-line learning of realizable 
tasks in feed-forward neural networks. We focus on the regime where the number 
of examples used for training is proportional to the number of input channels 
N. Using generating function techniques from spin glass theory, we are able to 
average over the composition of the training set and transform the problem for 
— > oo to an effective single pattern system, described completely by the student 
autocovariance, the student-teacher overlap and the student response function, 
with exact closed equations. Our method applies to arbitrary learning rules, 
i.e. not necessarily of a gradient-descent type. The resulting exact macroscopic 
dynamical equations can be integrated without finite-size effects up to any degree 
of accuracy, but their main value is in providing an exact and simple starting 
point for analytical approximation schemes. Finally, we show how, in the region 
of absent anomalous response and using the hypothesis that (as in detailed balance 
systems) the short-time part of the various operators can be transformed away, 
one can describe the stationary state of the network succesfuUy by a set of coupled 
equations involving only four scalar order parameters. 



PACS numbers: 87.10.+e, 02.50.-r, 05.20.-y 

1. Introduction 

It is now a little more than ten years since studies of the dynamics of supervised 
learning in artificial neural networks started appearing in the statistical physics 
literature. Early theoretical studies focussed on on-line learning using complete 
training sets where the probability of the same example appearing twice during 
training was zero, e.g. ^. This work enabled the evaluation of properties like 

convergence speed, generalization ability and optimal learning rates. However, such 
studies were still significantly removed from real-world scenarios. The most serious 
restriction was that one had to assume the availability of an infinite amount of training 
date, homogeneously distributed over the input space. In a recent article Q it was 
shown that even for very simple inhomogenuity the generalization error is no longer 
self-averaging and deterministic. The issue of repeating examples during training is 
technically a much harder problem and has received much attention recently. Most of 
the work has focussed on simple or linear learning rules fsl, |6| [zf or different kinds of 
approximations, such as Fokker-Planck approaches j|, |9|, |lO|, [ll|] and Gaussian local 
field distributions [p^ . Exact work on non- linear learning rules has drawn heavily 
on techniques from the spin glass and disordered systems community (for an early 
overview of these techniques see e.g. [^). The generating functional technique was 
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used to study the dynamics of Gibbs learning in a perceptron with binary weights in 
p4| , [T5[ |. A dynamical version of the cavity method was employed in |l^, |l^ to 
study gradient descent batch learning and the methods of dynamical replica theory 
were applied to the problem of on-line learning in 1 19, 2^, |2^, The on-line learning 
scenario in this last sequence of papers is the one that we study here, but in the present 
paper we adapt the generating functional method a la De Dominicis to deal with on- 
line learning. This paper might be the first to present exact macroscopic equations 
for on-line learning of restricted training sets for non-linear learning rules which are 
not of a gradient-descent type. 

Precise definitions will be given in section ||, but the general setup is the following. 
The examples presented to the student perceptron are N dimensional vectors chosen 
with equal probability from a fixed training set f2. The number of examples in f2 
is p = aN. At each presentation the student is given the teacher's classification of 
the pattern. The student can then decide to change its 'program', represented by 
the N dimensional vector a S , in order to resemble more the teacher's program 
T G M.^ . The random choice of a pattern from the training set makes the evolution 
of the student weight vector cr a stochastic process. In section ^ we write down a 
generating function for all the possible paths of cr. This function can be averaged 
over all possible realizations of the training set (a quenched disorder average). At 
that point we will take the limit N to infinity, to find saddle-point equations for a set 
of five order parameters and their conjugates. The reader who is mainly interested 
in results can skip section || and go directly to section |[ where the equations are 
reduced to a single exact set of three equations involving the student autocorrelation 
C{t,t') = cr{t)-cr{t')/N, the student-teacher overlap = (T(t) -r/TV and the student 
response function G{t,t'). This set gives a surprisingly simple and intuitive picture of 
the evolution of the order parameters and the distribution of the local fields. From 
that point it is easy to establish links with earlier work on infinite training sets, batch 
learning and linear learning rules. Numerical evidence is presented, showing that the 
present theory is in very good agreement with the simulations. 

In section ||, the stationary state of a student with constant weight decay is 
studied. For the stationary state one can split all relevant order parameters into 
persistent and non-persistent parts. If we keep only the persistent parts and the 
single-time non-persistent parts, we find a closed set of equations containing just four 
scalar order parameters. The procedure is inspired by a similar method applied to the 
solution of detailed balance spin glass dynamics, where it can be shown to be exact. 
Although the numerical evidence certainly seems to suggest that the procedure yields 
the correct results, we can not proof this fact rigorously here. At the moment, it 
remains an interesting open question. 



2. Definitions 



We study on-line learning in a student perceptron characterized by a vector cr G M . 
The student classifies patterns ^ G O C {—1,-1-1}^ according to S{$,) = sgn((T • ^). 
The student tries to learn the task set by the teacher r(^) ~ sgn(T • ^) with r G R^, 
i.e. we only consider linear separable classifications. The components of the weight 
vectors of teacher and student are assumed not to scale with N. The set f2 contains 
only p = aN examples, independently chosen with equal probability from {—1, +1}^. 
Patterns will be labeled by the Greek index /x. At each iteration each pattern is 
equally likely to be chosen for presentation to the student, independently of previous 
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rounds. If at step m, pattern /i(m) is presented to the learning student, the student's 
weight vector is slightly adjusted to converge to the desired classification according to 
a recipe of the general form: 

77 , ^ ( (Tim) ■ ^^^'^^ 



Fix,y) 



The speed of the evolution is set by the learning rate rj. The function F{x, y) is the 
learning rule. Popular learning rules are e.g. 

y — X, Linear 

sgn(y), Hebb 

sgn(j/) — X, Adaline (2) 

sgn(i/)9{—xy), Pcrccptron 

\x\sgn{y)0{—xy). Adatron 

where 9 is the stepfunction, 9{x) = 1 for x > and 9{x) = for a; < 0. The first three 
learning rules are all linear in x, while the last two only alter the student's weights 
when student and teacher disagree. 

A theoretical study of perceptrons can be useful for predicting learning times, 
for evaluating different learning rules or for finding optimal learning rates. For this 
purpose one is not so much interested in predicting the specific microscopic realizations 
of (7 over time, but rather in the number of errors the perceptron makes in the 
classification of the training set (training error, E^) and the number of errors in the 
classification of the complete set of examples {—1, +1}^ (generalization error. Eg): 

E,{a) ^ {ei-{a- ■ 0(r • ^)))a ^^'i'^ ' ^)(^ " ^))' (3) 

Eg{a) ^ {e{-{a- ■ 0(r • ^))) = ^ ^ 9i-{a- ■ ^(r • ^)). (4) 

€e{-i,+i}" 

Given cr, the generalization error is independent of the training set. It is in fact a 
standard result in perceptron theory that this error is only dependent on the angle 
between student and teacher vector, i.e. the norm of cr and its overlap with r. 

Egia)^l^rccos(^L=]. (5) 
3. The generating functional 

The random choice of a pattern /x(m) makes it more convenient to go to a description 
of an ensemble of students with a distribution of weight vectors, Pjn{cr), than to study 
the stochastic evolution of cr directly. In this setting we can study the (moment) 
generating function Zm for iteration times up to M: 

Zm[V]= jDaP{a{0),a{l),...,a{M))e'^"-oi'("^)-(-^), (6) 

where / D^^qCT = J Y[„i d(T{m) is an integral over all possible paths the students could 
take. Derivation of Zm with respect to x}) generates all moments of the distribution P. 
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The microscopic dynamics of weight vectors at time m can be written in the general 
form Pm+i{(^) = / d(T'W{(T\(T')Pm{(T'), with the transition probabihties 



(7) 



To disentangle the double ^ dependence of the transition rates, we employ the integral 
representation of the Dirac delta-function and introduce 



da 



exp [ia ■ (a- - a')] W{a\x', y, w), 



(8) 



(9) 



(27r)^ 

where we introduced three shorthands, called local fields (in analogy with spin systems) 



(10) 



and the Fourier transform of the transition rate W. For large N, W will be of order 
1 + 0{N~^/^) and will therefore factorize over the patterns 



W{a\x,y,w) = ^ exp {-ir]w^F{x^, t/^)) 



(11) 



= exp 



We can now rewrite the generating function: 

» M-l 

Zm[V]= / DaPo{a{0)) [] W{a{m + l)\a{m)) 



"-^^DxdyDwPo(<T(0))r[y, x, w, a] [] W{a{m) |x(m), y, w(m)) 
X JJ exp [^^[m) ■ {a{m + 1) — f(m)) + iip{m) ■ <T(m)] 

m 

where the appearance of the training examples is restricted to the function F, given 
by: 



r[y,x,w,o-] = l[S 



a:^(m) - 



o-(m) • e 



N 



N 



In the thermodynamic limit (A'' ^ oo), all the macroscopic observables in this model 
are self-averaging with respect to the realization of the training set. To avoid the 
difficulty of choosing a typical training set, we can thus safely consider the disorder 
averaged generating function [Z]dis. The only term involving the actual patterns is F. 
The quenched disorder average of F is 



[^]dis = J dyD9:Dwl[exp 



iy'^y'^ + '^ix^{m)x'^{m) + iw^(TO)w^(m) 



x2 



-N 



€''€{±1}' 
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Of the term on the second Hne, only the quadratic terms in r, cr and a survive in the 
thermodynamic hmit. Near this hmit we find that this term containing the training 
patterns becomes 

/ \ 2" 

1 / \ 

JJexp 



\ m m / 



in the thermodynamic limit. We assume that the initial probability distribution Po{cr) 
factorizes over sites. Pull factorization of the generating function over patterns and 
input channels can then be achieved if we introduce the following order parameters 
and their conjugates via delta-functions: 



-^^ai(m)Ti, 



Cmn = ^o-i(m)c7i(n), 

i 

Kmn = ^^<Ji{m)di{n). 



— ^CT,(m)CT,(n), 



N 

When changing m to m + 1, the expectation of these order parameters can only can 
by a value of order N~^. We thus rescale the time as t = m/AN. Prom here, one 
could go to a continuous time description by introducing t = At and taking the limit 
A to zero, but we delay this step in order to avoid technical difficulties in evaluating 
the path integrals. The generating function attains a form suitable for saddle-point 
integration: 



dis 



(X 



j ...exp [N{^ + $ + O)] 



(12) 



There are three distinct leading order contributions to the exponent. The first is a 
'bookkeeping' term, linking the order parameters to their conjugates: 

C^C + k^K 



^ = iR- R + ir-r + iTi C'C + K' K + c^c 
The second term reflects the coupled dynamics of the local fields: 



(13) 



dydy DxDx DwDw 
27r {2^Y (27r)^ 



exp 



iw ■ w 



+iy{y - 0^) + ix-{x- e^) - -xCx ■ 



—xKw — yR ■ X — -wcw — yr-w 



(14) 



where we have added additional sources dx and 6y to couple to x and y. These sources 
act as biases of teacher and student. The third term describes the evolution of the 
now decoupled weight components: 

DcrDa ^^^^ —inR ■ a — iriT - a (15) 



n 



(2^)1- 

■iaCa 



ia ■ 6i + ia ■ ipi 



where [Gq ^ju' = ^t+i,t' — ^tt' and where we have included an external driving force 
di{t) in the system. With a modest amount of foresight we write Gu' = —iKu'- Upon 
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taking derivatives with respect to the generating fields {tpi{t) , 0i{t)} , we find at the 
relevant saddle-point: 



Idis ' 



i 



lim ■ 



d 



7T [{<m. 



Using the built-in normalisation [-^(0)]^^^, we also find 



0, 



Ctt' = hm ^ 



T7t[^(0)U = 



nZ"oo N ^ dei{t)dei{t') 

If we perform the saddle-point integration, we find in addition that 



iRt 



iCw 



lim 



-y 



[^(o)U = o, 



1™ ,r 



de,{t)de,{t') 



At this point we can already simplify (or remove altogether) the generating fields 
0^[t) = 0t, 0^{t) = 0^t, 0^{t) = 0yt and ipi{t) = 0. The external fields 0^ and 0y can be 
interpreted as biases or thresholds of the student and teacher, respectively. Without 
loss of generality we may set Ti = 1. The evolution of the local fields and the weight 
vector are now linked only via the remaining non-zero order parameters. We proceed 
to evaluate the two separate processes at the saddle-point. 



3.1. Pattern average $ 

Focussing on the evaluation of the pattern average $ we find that the terms involving 
w can be interpreted as averages over a Poisson-distribution: 



/ 



dwt 
27r 



exp 



a \ 



1 



E 

kt=0 



dwt 
27r 



exp 



iwtwt - if]ktWtF{xt,y) 



A 
a 



1 



= J2 5{wt-rihF{xt,y))nh) 



kt=0 

where ¥{k) is a Poisson distribution with average A/a. For AiV ^ 1. P(fc) gives the 
probability that a specific pattern is presented k times to the student in time interval 
A. The saddle-point equations of the remaining non-zero order parameters are found 
to be: 

d ^ d 

^* = 2icit' = a{ftft')^, iGtt' = -a—^{ft')<b, (16) 



80, 



80. 



xt 
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with the shorthand ft — riktF{xt,y). The average (•)$ is using the measure imphed 
by equation (p^. Performing the disorder average has turned the y integral into a 
Gaussian one. Evaluating this integration yields: 



^ — a log 



exp 



dy DxDx 



2^ (2^)^ 



n 



kt 



nkt) 



(17) 



-^{y-^vf - ^xDx + ix- {x- 



%-Gf- R{V - Oy)) 



where we have introduced the student autocovariance Du' = Ctf — RtRf ■ We note 
the operator identity d/d9y ~ y — R ■ d/dO^, which in turn implies using ( |T^ ) that 



t' 



(18) 



3.2. Weight component average fl 

The saddle-point equations involving the weight vectors are: 



Gtt' 



d 



(19) 



where (•)n is an average with the measure induced by (|15[). This measure can be 
generated by the stochastic process: — r + iU^a + Gj^^ct — 9 — p — 0, where pt is 
a Gaussian noise with zero mean and covariance {ptpt') — ^tt' = '^ictt'- From this 
process, we find a simple expression for a (upon setting 9 — 0): 

a = G(f+p), (20) 
with the response, student-teacher overlap and student autocovariance given by 



G= Go^+iG^ 



R = Gr 



D = GAG^ 



(21) 



4. Effective single pattern process 

Upon combining the results of the previous two paragraphs, we find a closed set of 
exact equations relating the evolution of R, D and G to the evolution of the local 
field distribution implied by the measure in ([l^). Setting 9y — in this equation, 
we find that the distribution is generated by the following stochastic process for a 
student-pattern overlap: 

■zt+9,t, (22) 



Xt 



Rty + J2'^tt'ft' 



where y and zt are independent Gaussian random variables with zero mean and 
variances (y^) — 1 and (ztZf) — Dtv ■ In general xt will depend on previous values of 
X via the term [G/]t = r\Y^^, Gtt'kf F{xt' ,y). 

The evolution of the order parameters is given using the bare propagator Gq. 
To ensure that the students' weight distribution will eventually reach a stationary 
state, we let the weights decay with rate 7. The bare propagator then takes the form 
[G(j'"'^]tt' = St+i^t' — A(5tt', where A = 1 — A7. In the limit of A — > 0, this corresponds to 
[Go]tt' = 9{t - t'~ A/2) exp[-A7(t -t' - 1)]. The equations (|2l|) and (|l|) determine 
the evolution of the response function: 

, ^,dft. 



[Gq ^G]tt' — ■ 



)G, 



(23) 
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Figure 1. The evolution of tlie generalization (upper lines with diamonds) and 
training error (lower lines with squares) for the perceptron (left) and adatron 
(right) learning rules for various training set sizes. The lines correspond to 
(generalization error: top to bottom, training error: bottom to top) a = 
0.25, 0.5, 1, 2, 4. The markers correspond to single run simulations (N=6000) with 
no decay and learning rates ri = I (perceptron) and rj = 1.5 (adatron). The solid 
lines are the results of numerical calculations of the effective single pattern process 
with M = 20, 000 and time step A = 0.05. 



where now (•) is the Gaussian averages over y and aU zt's. Using equation ( |2l| ) along 
with the relation (111), we find: 



[Go'R]t ^n-Yl 'GlRs = a{yft). (24) 

S 

The combination of equations ( ^l|) and (^) gives the evolution of D\ 

[G^'D]u' ^ aiMGf + z]t>) ^ a{ft{xt' - Rt'y)), (25) 

where we have set 6^ to zero. The evolution of the diagonal terms of Qt = Du — R^ 
can be implicitly calculated using this equation, but the distribution of Xt+i has to 
been known before Qt+i can be calculated. To avoid this difficulty, we use equation 
(|5|) together with the scaling arguments Gu' = C(A^) for t < t\ Gtt = 0{A) and 
Zt+i — Xzt = 0(A^/^) to determine for small A the evolution of Qt more explicitly: 

Qt+i = X^Qt + aX(ftXt) + a{ftxt+i) 

= X^Qt + 2aX{ftxt) + a{M[z + Gf]t+i - X[z + Gf]t)) 

= X^Qt + 2aX{ftXt)+a{f!)+0{A^/^) (A ^ 0) (26) 

The equations (p3|)-(p6|) could perhaps have been found using less sophisticated 
methods, but the strength of the generating functional method is that it is also 
capable of producing the joint local field distribution P{x, y) generated by (p^). The 
generalization error is a direct function of these order parameters, while the training 
error is a slave of the local field distribution governed by them: 

1 / Rt 



TT 



Eg{t) = - arccos , Etit) = {9{-xty)) (27) 



The evolution of the order parameters can be calculated numerically by a Monte Carlo 
procedure similar to the single spin procedure outlined in . The general idea is to 
follow the evolution of M patterns overlaps. For each of these patterns, one generates 
at time t — a, teacher overlap y from the standard normal distribution. Time is 
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discretized with unit A. At each time step and for each pattern, one generates the 
Gaussian noise Z(, correlated with the previous noise values for that particular 
pattern and a Poissonian random variable kt- Averages over all patterns are Monte 
Carlo implementations of the averages occuring in the evolution equations for D,R 
and G. By increasing M and decreasing A the evolution of the N — > cx)-perceptron 
can be calculated up to arbitrary precision. This is shown for various a in figure ^ 
with M = 20, 000 and A = 0.05. The figures illustrate that the agreement of the 
theory with the simulations is quite satisfactory. 



Batch learning 

So far, we have treated only the case of on-line learning. This is the most widely 
applied learning scenario, but much of the analytical work on learning with restricted 
training sets has been devoted to off-line or batch learning. In batch learning one first 
calculates the average effect of learning (a large sample of) the entire training set, 
before making a weight update. For small learning rates, batch and on-line learning 
ought to generate the same macroscopic flow. For completeness we discuss here what 
changes when we switch from an on-line to a batch scenario. The effect on the theory 
as presented above is the disappearance of the extra noise term (f^) in the evolution 
of Q in equation ( p6| ) and the replacement of the Poisson variable by its average 
A/a. The intuition behind the first change is that big changes in the student weight 
vector can no longer happen after a single pattern is presented; the weights undergo 
a much smoother evolution due to the averaging of the update over all patterns. As 
a result of the second change, the student training pattern overlap becomes: 

xt ^ Rty + zt + r]—y" GtsF{xs, y) (28) 

s 

This equation was derived earlier in the context of gradient-descent batch learning 
by Wong et. al using an elegant application of the dynamical cavity method |p^ . 
Again, the reason for the change in a training pattern overlap xt is that instead of 
big changes when kt times that particular pattern is presented to the student in time 
interval {At, A{t + 1)), now during an interval interval Xt feels the average effect of the 
influence of the pattern. These are the only changes necessary in the present analysis 
when switching from on-line to batch learning. 



4-2. Linear learning rules 



The average occurring in the evolution of the response function G in (|2^) can be 
explicitly calculated if the student is using a learning rule that is linear in x, e.g. the 
linear, Hebbian or adaline rules. For this type of rules of the form F{x, y) = g{y) — cx, 
we find: 



-9/, 



dx 



- = -rjckt { 6tt' + 




(29) 



To causality of G allows us to perform the Poisson averages and a little matrix algebra 
leads to 

A r' 

cTj—G G (30) 
a 



Go^G = I - Acr; 
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The resulting response is translation invariant, i.e. Gtf — G{A{t — t')) for t > t' . 
The on-line response found here for linear learning rules agrees with the batch-results 
found for the linear rule in |^] and the adaline rule in [|l^. The Fourier transform of 
the previous relation reads 

G-^{uj) ^^-iuj + cT]^-^-^^^ (31) 

This equation is analysed in For 7 = and for c ^ 0, a transition in the behaviour 
of the response takes place at ac = 1. This position is identical for on-line and 
off-line learning. The nature of this transition is easily understood. Without decay, 
the evolution of the weight vector is confined to the linear subspace spanned by the 
patterns in the training set. Below ac = 1, the random patterns are unlikely to 
span the whole iV-dimensional space, resulting in a non-decaying part of the response 
function. This argument is valid for general rules without decay. 

The student overlap with a particular pattern can also be written in a more 
explicit way: 

x = [I + c?]GK]-^ {Ry + z + ri9{y)Gk), Kw = hSw (32) 

The final results are rather cumbersome, but all the averages appearing in the evolution 
of the order parameters involving k and z can be done without any problems. The only 
remaining integrals are of the form {g{y)y) and {g{y)'^) with the standard Gaussian 
measure. 



4-3. Infinite training sets 

To compare our results to the well-known unrestricted training set results, we take 
the limit a — s- cx). In this case the probability of repeating an example is zero. This is 
reflected in the fact that (fc) — ^ as a — > 00. Given y, the local fields x are random 
variables given by: 

xt = yRt + zt, (33) 

or, equivalently, xt is a Gaussian random variable with mean yRt and covariance Du'- 
The effects of the retarded self-interaction caused by G thus completely vanish. If we 
go to a continuum time description, we recover equations found in e.g. p4| , |^. The 
evolution of the student-teacher overlap and the student self-overlap are given by 

^^-jRt + fj{yFix.,y))t (34) 

^ ^ -2^Q + 2r^(xF{x,y))t +r,^{Fix,yf)t, (35) 
with the Gaussian single-time average defined by {x)t = 0, {y)t = 0, {x^)t — Qt, {y'^)t — 

1: {xy)t = Rt- 



5. Stationary state 

Many learning rules will not reach a stationary state that is independent of the initial 
conditions, as soon as weight decay is absent. Weight decay, or another type of 
constraint, may also be necessary to bound the length of the student vector. In the 
Hebbian case, for example, the student weights keep on growing in the direction of the 
perceived teacher, regardless of the size of the training error. In order for the student 
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Perceplrc 




Figure 2. Stationary generalization (upper line) and training error (lower line) 
for perceptron (left) and a Hebbian (right) learning rules. Learning rate r] = I 
and decay 7 = 0.1. Markers are simulation results of a single run with A'^ = £000 
inputchannels. Solid lines are theoretical predictions, obtained by solving (|22[) 



to reach a stationary state, we assume that the weight decay 7 is large enough to 
bound R and C and that the integrated response or susceptibility g is finite: 



g= hm A V Gw < 



00. 



(36) 



This condition, known in the disordered systems literature as absence of anomalous 
response. We also assume that for sufficiently large t the order parameters become 
time translation invariant: Rt = R, Gt+T,t = Gr, Dt+^.t — D^. These assumptions 
are related to the replica symmetry ansatz in the replica equilibrium analyses . We 
split the covariance kernel Dt in a persistent part d — limt^oo Dt and a non-persistent 
part Dt = Dt — d. If d exists, then 

T 



d = D= hm 

T->oc. T 



im -VA 



t=0 



Given time translation invariance, one derives from equations ( p^ ) and ( p5| ) that 



R^±{yF{x,y)) 
7 



d = lim lim ^ {F{xt+r , y) (xt - yR)) = ^ {F{x - yR)) 
A relation earlier found involving the covariance ( ^T[ ) now yields 



d = aG{ffT)GT 



(37) 
(38) 

(39) 
(40) 



while the stationary value of Q can be found from (pq): 
Q = ^(i^(x,2/)x) + ^(F(x,2/)2). 

All the averages either involve a single time or two infinitely separated distant 
times. We lack an explicit expression for the single time probability distribution of Xt- 
The probability of Xt is related to realisations of x at previous times via the response 
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Figure 3. Stationary generalization (upper line) and training error (lower line) 
for perceptron learning rule with various decay rates 7. Training set size a = A. 

function G. This makes the evaluation of the averages as hard as solving the dynamical 
equations themselves. The same problem exists in the field of (Ising) spin glasses and 
recurrent neural networks. In those cases where the stationary state is in detailed 
balance (e.g. when the dynamics are of gradient descent type and the systems feels 
a Gaussian white noise) a fluctuation dissipation relation connects the correlation C 
and the response G. It is known for such systems that when calculating the persistent 
and single time parts of the correlation and the integrated response, the non-persistent 
parts can be chosen arbitrarily as long as the FDT is obeyed. In particular one can 
set them to zero and take only the persistent parts and the integrated response into 
account. Although there are big differences between the learning perceptron discussed 
here and the aforementioned spin systems (for one, the learning rule F does not have 
to be a gradient) , we assume this decoupling property of persistent from non-persistent 
parts still holds||. We replace the equilibrium distribution of x generated by equation 
( p2| ) by a distribution generated by a stochastic relation containing only the integrated 
response g and random variables described by the persistent part of the covariance 
matrix D, the single-time correlation Q and the student-teacher overlap R: 

xt^yR + z + 2t + -gF, (41) 
a 

where y, z and zt are all independent Gaussian random variables with zero mean and 
covariances (y^) = l,{z^) = d and (ztZf) — {Q — R^ — d)Stt' ■ The average learning 
term F for a specific pattern with a certain {y,z), can be expressed self-consistently 
as: 

Fy,= \im l^F(a;t,y)= f dip{S)F{yR + z + z + ^Fy,,y){A2) 

For Hebbian learning one has F = sgn(j/), but in general ( p2[ ) will be a transcedental 
equation, so one has to revert to numerical methods to solve it. Once F can be 
found for any point {y,z), the remaining two independent Gaussian integrals over y 
and z can be evaluated to close equations ( p7| ) to (^0|). The remaining closed set 

X Note that a rigorous proof would first require the derivation of a non-equilibrium generalization of 
FDT theorems. 
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can be solved numerically. Results for Hebbian and perceptron learning rules and 
various training sets sizes are presented in figure ^. For the perceptron rule, the 
results shown in figure || compare Eg and Et for different decay strengths. Perceptron 
results are independent of rj. The theoretical predictions seem to be in almost perfect 
agreement with the simulations. Although no adatron results are shown, we expect 
that the proposed procedure is equally valid for the this latter rule. Our method 
of calculation is only valid when G is time translation invariant and the integrated 
response is bounded. For this to happen, we need the presence of a weight decay. The 
complication is that any decay, however small, will cause the adatron student weight 
vector to vanish. An alternative way of ensuring that the student ensemble reaches a 
stationary state that does not exhibit this problem is by constraining cr to a sphere. 
This can be implemented by choosing 74 oc {Qt — 1). However, the adatron rule yields 
zero training error in this setup. This causes other problems in numerically evaluating 
this stationary state equations. 



5.1. Distribution of local fields 

As seen earlier, a big simplifying effect of the limit a —> 00 is to render the local fields 
X and y Gaussian. This happens irrespective of the learning rule involved. As soon 
as a < 00, the effect of the extra term Gf in equation (|4l] ) sets in and the Gaussian 
form of the distribution evaporates for non-linear rules. The non-Gaussian form of 
the joint local field distribution has been discussed at length in j2^, but equation ( ^l|) 
gives an intuitive idea of the origin of the deviations reported there. 

For a Hebbian learning rule, F{x, y) = sgn(y), the conditional distribution p{x\y) 
remains Gaussian with variance D, but will be shifted away from the mean yR by the 
amount rjgsgn{y) / a. An example with a — 1 and 7 = 0.1 is shown in figure For 
the perceptron learning rule, this is no longer true. The random variables y and z are 
independently distributed Gaussian variables. From (|4|) we find that: 

F.V(.)--erff^^±^±El (43) 

Samples of the {y,z) statistics for a — \ and 7 0.1 of as a function of 
z are shown in figure ||a for y > (top) and y < (bottom). The width of the 
sloping segment is rjg/a, while the size of \J D — d determines the rounding at the 
edges. The value of x corresponding to y = 1 as a function of the Gaussian disorder 
z is drawn in figure [sja. For y positive and roughly z > —yR, one has x = yR + z, 
whereas for z < —yR — rjg/a one finds x = yR + z -\- rjg/a. For z in the range 
—yR — rjg/a < z < —yR, we find x ^ 0. In this particular example (using the same 
values for the order parameters as the graphs shown in figure |c) Vrf ft! 0.27 so that 
the Gaussian measure confines z close to the origin. Thus the resulting local field 
distribution is distinctly non-Gaussian as shown in figure 



6. Conclusion 



In this paper, we have studied the statics and dynamics of an ensemble of students 
learning on-line the classification of a large number of examples. This problem 
boils down to solving a large number of coupled stochastic difference equations, each 
corresponding to a single input channel. The situation is complicated by the existence 
of disorder in the form of the composition of the training set. Using the generating 




Figure 4. Stationary local fields distributions p{x, y) for an infinite (a = oo) 
training set after perceptron learning (top), and two finite (a = 1) training 
sets after Hebbian learning (middle), or perceptron (bottom) learning. Left are 
simulation results, right are theoretical predictions in the form of contour plots. 
The infinite training set yields a joint Gaussian distribution, Hebbian learns gives 
only a conditionally Gaussian p{x\y) and the perceptron rule deviates even further 
from the Gaussian shape. Learning rates are 77 = 1 and decay coefficients are 
7 = 0.1 in all three graphs. 



function method we have transformed this Markovian system of N coupled equations 
in the hmit of N to infinity into an effective single pattern process. The price paid 
for this reduction is that the new process has noise which is correlated in time and 
the presence of a retarded self-interaction in the system, which make the dynamics 
non-Markovian. In principle it is possible to calculate the evolution of the system 
analjdiically, but in general it will be impossible to pursue this after the very first few 
time steps. However, the process can be solved numerically up to arbitrary precision. 

Our calculation provides a solid basis for the further analytical study of linear 
rules. For non-linear rules the importance of our exact macroscopic dynamical 
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Figure 5. a) The relationship between F (see equation (^3|)) and z for y = 1 
(upper line) and y = —1 (lower line) for the stationary state of a perceptron with 
o = 1 and 7 = 0.1. The width of the sloping part is close to r)g/a, the abcissas 
are near —yR. b) The time average of xt as function of z, given y = I. The 
abcessas of the thin straight lines are near —yR — r^g/o and —yR. Due to the 
Gaussian measure of z centered at the origin, the part close to the x-axis is the 
main important contribution. 



equations is mainly in the insight they can give into the behaviour of different learning 
rules and the possibility they create to study and solve stationary states of both on- 
line and batch, gradient and non-gradient learning. Until now, the stationary states 
of these kinds of learning processes have only been directly accessable with tools from 
equilibrium statistical mechanics, requiring detailed balance. This confined analyses 
to batch gradient-descent learning. This restriction has now been lifted. From our 
macroscopic evolution equations we can extract the stationary state equations very 
easily if we assume time translation invariance and the absence of anomalous response. 
We have not yet addressed the issue where this is likely to hold for on-line learning. 
To reduce the time-dependent order parameters like the student-autocorrelation and 
the student-response to a finite set of scalar order parameters, we apply a method 
we know from similar spin-glass problems based on the detachment of single-time 
and persistent order parameters from the non-persistent ones. The procedure consists 
of removing all non-persistent parts of the order parameters (except for the single 
time quantities), retaining only a small closed set of equations containing just four 
(Q,R,d,g) scalar macroscopic order parameters. Whether this last procedure is indeed 
exact, remains to be seen and will be the subject of a future study, but the numerical 
evidence clearly suggests that the underlying assumption holds. 
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